-
Notifications
You must be signed in to change notification settings - Fork 665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New NBFT initramfs module #2620
base: master
Are you sure you want to change the base?
Conversation
Intended to create udev network link files (see systemd.link(5)) from ACPI NBFT tables present in the system, typically called during early boot process. A semi-predictable interface naming is proposed to accurately identify HFI that the interface is set up for. The syntax is as follows: nbftXhY where X is the NBFT table index (filename, defaults to 0) and Y is the HFI record index. Signed-off-by: Tomas Bzatek <[email protected]>
This is a minimal dracut module providing Boot from NVMe/TCP functionality. Consisting of two simple systemd units to set up networking and perform actual connections afterwards. Requires the NetworkManager nm-initrd-generator NBFT support. Signed-off-by: Tomas Bzatek <[email protected]>
In case the `nvme connect-all --nbft` call fails, respawn the service after 10 seconds and try again. Depends on nvme-cli to return non-zero status in case of any of the SSNS records that are not marked as 'unavailable' fails to connect. The respawn cycle is broken by stopping the unit once rootfs is mounted and system switches the root. Signed-off-by: Tomas Bzatek <[email protected]>
5d23b62
to
3fdb4b9
Compare
Very interesting. The timing is kind of unfortunate for me. I'll not be able to do an in-depth review until the beginning of next year. Some quick questions / remarks. I've only glanced about this quickly, so I hope I'm not getting it all wrong.
|
In my mind, the current "API" of nvme-cli wrt NBFT is the JSON format. I am aware that that's subobtimal if you want to build a minimal initramfs, as it requires either a JSON library or something like jq, which is rather big. Still, discuss carefully if we want replace this by something else. |
One thing I'd like to bring up, is that I think we should reconsider the Almost all other commands by nvme-cli are just simple commands which do one thing. I think it should be possible build any complex logic ontop of it, e.g. using scripts. My goal would be to move the missing parts of This is obviously all v3 material. |
Thanks for your first comments, Martin. There's a little more background for this work. I'll try to explain but take it with a grain There's a desire (Fedora, RHEL) to gradually slim dracut down mid-term or even phase it out in the long-term. There's an ambition to use projects like This pull request is a first step towards this goal - minimizing dracut involvement in the boot process. Dracut would stay just as a tool to build the initramfs image, copying required files in place and not much more. Then it would be mostly a matter of changing systemd units and targets dependencies when porting to a different initramfs framework. This should answer some of your questions about "why not dracut". Related to that, there's a tendency to move the particular initramfs functionality towards their native packages and transfer maintenance responsibility along with it. Updating a specific dracut module in a distro and the associated QA cost is often much higher than keeping the module at the edge, i.e. together with the required binaries and libraries. Perhaps this work should be called an initramfs module instead. |
Understood. Similar here. Even now, dracut has no monopoly on building initramfs images. mkosi is relatively new on the stage, but Debian's update-initramfs has existed for years. We've actually discussed about it in the Timberland group, but there didn't seem to be anyone from the Debian/Ubuntu realm interested in NVMeoF-boot. That's why only dracut is currently supported.
Hm, that's exactly what it is today, no?
That makes some sense. The nvme-cli maintainers will need to decide whether they want to take the responsibility of maintaining the dracut module. It's tricky because the module interfaces intimately with both nvme-cli and dracut. But I should look at the code first... |
# /usr/lib/NetworkManager/dispatcher.d/99-nvme-nbft-connect.sh hook | ||
# will respawn nvmf-connect-nbft.service on such occasion. | ||
|
||
[device-nbft-no-ignore-carrier] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about this config section. Will you implement support for this section, and the directives it contains, in NetworkManager? If not, how is this supposed to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code quality is eccelent, AFAICS. I also like the design. For me, the open question is mostly which part of this code actually belongs into nvme-cli.
I'd prefer if we could separate out the NetworkManager-specific part more. So that we'd have obvious ways to extend this to other network management tools like wicked.
I believe that libnvme/nvme-cli shouldn't come with a preference for one specific network management tool, even if NM is the "de facto standard" these days.
Likewise, it would be nice not to depend on dracut directly and to just include the dracut code as a PoC, showing how the systemd services need to be complemented in the initramfs. It would be even nicer if we had a mkosi PoC as well.
Wrt dracut specifically, it's up to the nmve-cli / libnvme maintainers to tell whether they are willing to maintain a dracut module.
Two major packages are responsible for this: the new nvme-cli dracut module and | ||
the added NBFT support in NetworkManager. | ||
|
||
## The new dracut 95nbft module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: nothing ages so quickly as the word "new" ...
dracut to activate networking | ||
* dracut runs `nm-initrd-generator` and starts the NetworkManager daemon | ||
* `systemd-udev-trigger.service` renames the network interfaces | ||
* `nm-wait-online-initrd.service` finishes, indicating networking is up and ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my remark below about nbft-boot-connect.service
. I am somewhat confused about when exactly network-online.target
is considered to be reached.
|
||
For simplicity and for the time being this systemd unit replaces the traditional | ||
dracut cmdline hook and adds the `rd.neednet=1` `cmdline.d` argument. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly do you mean with "traditional dracut cmdline hook" here? How can you replace it?
## nm-initrd-generator NBFT support | ||
|
||
https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/2077 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the most important difference wrt 95nvmf
. You create a tool that directly transforms libnvme NBFT data structures into the preferrred format of your Network configuration tool (NM), whereas 95nvmf
converts NBFT to JSON first, then JSON to dracut-style command line arguments, which would then further processed by the conventional nm-initrd-generator
.
Your approach is of course much more efficient, but at the cost of being compatible only with (a future version of) NM.
A similar approach could be taken by wicked or other network management tools.
But I wonder if there might be some middle ground, perhaps we can provide the HFI data in some format that any network management tool can easily convert?
The "dracut command line" format is obviously very clumsy and simplistic.
I'd love to see dracut develop a more capable generic format for describing networking parameters. But I doubt that's going to happen. After all, dracut's format was designed with the kernel command line in mind.
So we could keep 95nvmf
for those network management tools that have no native NBFT support, and take your approach for others. I'd need to discuss with the wicked people if it makes sense for us to write a similar plugin.
attempts are not driven by network link up events but have fixed respawn | ||
interval. This may potentially help the cases where the NIC is slow to | ||
initialize, reports link up yet it takes another 5+ seconds before it's fully | ||
able to send/receive packets. We've seen this issue with some 25Gb NICs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
Ideally we wouldn't react on "link up" events but on events that indicate an L3 connection. But I'm not sure if such events exist... (see below)
Note that 95nvmf
also has a timeout
action. But repeating this in regular intervals is actually a nice idea.
@@ -571,3 +571,108 @@ int show_nbft(int argc, char **argv, struct command *cmd, struct plugin *plugin) | |||
} | |||
return ret; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here you add support for one specific network configuration file format, udev's .link
files. It would be possible to derive the .link
files from the JSON output, but this is obviously much faster and more light-weight, especially compared to using jq for the conversion in the current code.
Yet it's not immediately obvious why we'd implement this in nvme-cli, while the conversion of HFI data to .nmconnection
files is done in the Networkmanager plugin. Instead of adding this functionality to libnvme, you might as well have written a separate tool that links to libnvme and creates the .link
files. Why do this here and that there?
The boot process looks roughly as follows: | ||
* `nbft-boot-pre.service` is run, creates udev network link files and tells | ||
dracut to activate networking | ||
* dracut runs `nm-initrd-generator` and starts the NetworkManager daemon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dracut runs
nm-initrd-generator
I fail to see where this happens. Is it an ExecStartPre=
in NetworkManager.service
?
match-device=interface-name:nbft* | ||
|
||
# react on link up/down events | ||
ignore-carrier=no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more intuitive to add globbing support to NetworkManager's "Device List" format, and then simply set something like
[main]
ignore-carrier=except:interface-name:nbft*
Anyway, this is Fedora/RHEL specific. Not sure if other distibutions ship tje Networkmanager-config-server
package. On the libvme side, I'd like to keep the amount of NetworkManager-specific code as minimal as possible, thus I have my doubts whether this belongs here.
I'd prefer if you find some other way to set this up on Fedora. Perhaps simply by documenting that nbft devices should be excempted from the ignore_carrier=
list in NetworkManager.conf
? Or by simply changing the config file that NetworkManager-config-server
ships?
the required block device appears, the wait cycle is ended and the system | ||
continues booting, stopping any queued `nbft-boot-connect.service` respawns | ||
seamlessly. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see below that nbft-boot-connect.service
contains After=network-online.target
. Thus no NBFT target will be connected before this target is reached. Maybe I misunderstand when you consider "online" state to be reached. Is it already reached when just one iBFT interface is configured?
* `nm-wait-online-initrd.service` finishes, indicating networking is up and ready | ||
* `nbft-boot-connect.service` initiates actual NVMe connections | ||
* the dracut initqueue is waiting for specific block devices (rootfs) to appear | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this, and the starting of NM above, are the only 2 things that dracut needs to perform, and therefore it's relatively easy to plug this scheme into mkosi or another initramfs generator, as long as that generator is based on systemd. Nice.
Am I understanding correctly?
I don't have any problems if this is added BUT I'd really want to see some sort of CI testing for this. This stuff will bit rot immediately after merging if we don't have regular upstream testing. |
That's the point, it's hard to write meaningful tests for this. It requires simulating at least a client/server setup with appropriately crafted BIOS settings. While there are client/server tests in the dracut test suite, I'm not aware of a test for NVMeoF/NBFT. |
We have a bunch of exising mock tests. Could we turn this into something more? |
Cc: @mwilck
-- Please check the included README.md first
This started as an experiment from an idea coming from our inhouse dracut and systemd developers. I kinda like the result even though there are many loose ends - read on. There's an unpublished internal plan to slim dracut down in the mid-term and while I can't speak of the details one major argument was something like "dracut was born before systemd" and "systemd nowadays can take over many tasks dracut is doing". Add embedded, automotive and otherwise resource-restricted environments and the push for minimization grows further.
This experiment was about creating specially crafted systemd units injected at the right place with as little dracut involvement as possible. I admit all this work may speak itself a little of a NIH syndrome, still I hope it becomes useful.
There are some loose ends here:
95nvmf
dracut module. Haven't approached dracut upstream yet.nbft
plugin command, at least for now. We can move this somewhere else as the only entry point is the systemd service here. Also the commandline arguments and help strings may need some tweaks. It was strongly suggested to use systemd link files instead of udev rules for interface renaming. However during my experiments with VLANs this appears to have some drawbacks, so this is still undecided and may opt for udev rules instead. TBD.Needs #2614 (already merged).
This work also gives answer for #2179 and includes the files we've kept downstream so far.
And some more wild ideas:
libnvme
NBFT parser in a separate, independent library (just like the -mi one) that would be small and suitable for initramfs, or statically linked. One suggestion was to have native udev support for nbft interface naming within builtin-net_id.c however the parser dependencies were a problem. The ACPI NBFT table is not a fixed-size struct and is not so trivial to parse if somebody wanted to reimplement the parser just to extract the HFI descriptors.